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Thesis 

*  Over  the  last  1 0  years  R&D  investments  have  made  high 
performance  embedded  computing  for  national  security 
applications  more  like  mainstream  high  performance 
computing 


*  Over  the  next  10  years  R&D  investments  will  make 
mainstream  high  performance  computing  for  national 
security  applications  more  like  high  performance 
embedded  computing 
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Thesis 

*  Over  the  last  1 0  years  R&D  investments  have  made  high 
performance  embedded  computing  for  national  security 
applications  more  like  mainstream  high  performance 
computing 

-  DARPA  Touchstone 

-  DARPA  Embedded  Systems 

-  OSD  High  Performance  Embedded  Computing  Software 
initiative 

*  Over  the  next  10  years  R&D  investments  will  make 
mainstream  high  performance  computing  for  national 
security  applications  more  like  high  performance 
embedded  computing 

-  DARPA  Adaptive  Computing 

-  DARPA  Data  Intensive  Systems 

-  DARPA  Poiymorphic  Computing  Architectures 

-  DARPA  High  Productivity  Computing  Systems 
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Thesis 


-  OSD  High  Performance  Embedded  Computing  Software 
Initiative 


-  DARPA  High  Productivity  Computing  Systems 
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HPEC  Software  Initiative 


Program  Goals 

*  Develop  and  integrate  software 
technologies  for  embedded 
parallel  systems  to  address 
portability,  productivity,  and 
performance 

*  Engage  acquisition  community  to 
promote  technology  insertion 

*  Deliver  quantifiable  benefits 


Portability:  reduction  in  lines-of 
code  to  change 
port/scale  to  new  system 
Productivity:  reduction  in  overall 
lines-of-code 

Performance:computation  and 
communication 
benchmarks 


Performance  (1.5x) 


AFRL 
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MIT  Lincoln  Laboratory 


AAITFE 


Each  server  or  multicomputer 
forms  an  independent  signal 
processor.  The  I/O  server  round 
robins  data  to  each  signal 
processor  to  meet  throughput 


Software  Middleware 


Standards 


MPI 

VSIPL 

DRI 


Common  Imagery  Processor 
_ Experiment  Overview 


Distributed  Memory  Multicomputers 


Distributed  memory  CIP  software  would  allow  insertion  of  embedded 
multicomputer  or  commodity  clusters  as  a  signal  processor  in  CIP  system 


- AFRL  - 
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HPEC-SI  Middleware 


Development 


Applied  Research 


VSIPL++ 


Parallel  VSIPL++ 


MAPPING  (data  parallelism) 

-Early  binding  (computations) 
-Compatibility  (backward/forward) 

-Locai  Knowledge  (accessing  iocal  data) 
-Extensibility  (adding  new  functions) 
-Remote  Procedure  Cails  (CORBA) 

-C-i-i-  Compiier  Support 
-Test  Suite 

-Adoption  Incentives  (vendor,  integrator) 


-MAPPING  (task/pipeline  parallel) 
-Reconfiguration  (for  fauit  tolerance) 
-Threads 

-Reliability/ Availability 

-Data  Permutation  (DRI  functionality) 

-Tools  (profiles,  timers, ...) 

-Quality  of  Service 


- AFRL  - 
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From  the  Small  to  the  Big 


Earth  Simulator 


Earth  Simulator  pictures  from  www.es.jamstec.go.jp/esc/eng 
JSTARS  pictures  courtesy  of  Northrop  Grumman 
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Efficiency:  The  Big  Picture 


Losses  accumulate  from  the  point  of  electricity  generation, 
through  distribution,  and  finally  during  utilization  by  the  end  user 


Efficiency  of  Electrical  Power  Generation 


Coal  Coal  Coal  Coal  Gas  Gas 
Carvlle  NTess  CCGT  CHT 


www.electricity.org.uk/uk_inds/environ/env_19.html 

www.parcon.uci.edu/paper/energy.htm 


One  Ton  of 
Coal  Generated 

Year  Energy 
1891  150  kWh 

1914  550  kWh 

1 920  630  kWh 

1 939  1 566  kWh 

2002  3000  kWh 

Electricity  distribution 
efficiency:  92% 


ASCI  Q:  24  -  30  Tf lop/s  (peak) 
3  megawatts  to  run  plus 
2  megawatts  to  cool 
(energy  for  5000  homes) 


Computing  efficiency 

Gflop/s 
Percent  peak 
Gflop/s/Watt  , .  „„ 

MiTRE 
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High  Productivity 
Computing  Systems 


Goals: 

>  Provide  a  new  generation  of  economically  viable  high  productivity  computing  systems  for  the 
national  security  and  industrial  user  community  (2007  -  2010) 


Impact: 

•  Performance  (efficiency):  critical  national  security 
applications  by  a  factor  of  1 0X  to  40X 

•  Productivity  (time-to-solution) 

•  Portability  (transparency):  insulate  research  and 
operational  application  software  from  system 

•  Robustness  (reliability):  apply  all  known  techniques 
to  protect  against  outside  attacks,  hardware  faults, 
&  programming  errors 
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HPCS  Program  Focus  Areas 


•  Intelligence/surveillance,  reconnaissance,  cryptanalysis,  weapons  analysis,  airborne  contaminant 
modeling  and  biotechnology 


Fill  the  Critical  Technology  and  Capability  Gap 
Today  (late  80 ’s  HPC  technology) . to . Future  (Quantum/Bio  Computing) 
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HPCS  Program  Phases  1-3 


Metrics  and 
Benchmarks 


HPCS 

Capability  or 
Products 

Application 

Analysis 

Performance 

Assessment 


Industry  Guided 
Research 


I 

Requirements^ 
and  Metrics 


Concept 

Reviews 


Readine 
Fiscal  Year 


Phlse  2 

Reviews 


02 


/\  Reviews 

Industry  BAA/RFP 

Critical  Program 
Milestones 


03 


Phase  1 
Industry 
C(kicept  St/dy 


Early 

Software 

Tools 


Academia 

Research 

Platforms 


Early 

Pilot 

Platforms 


Products 


System 

Design 

Review 


PDR& 

Early 

Prototype 


Research 
Prototypes 
&  Pilot  Systems 


Technology 

lAssessments 
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Phase  2 
R&D 


Phase  3 

Full  Scale  Development 
(Planned) 
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Bounding  the  HPCS  Challenges 
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Why  applications  with  iimited  memory 
reuse  perform  inefficientiy  today 

STREAMS  ADD:  Computes  A  +  B  for  long  vectors  A  and  B 

Year  of  Introduction  (Cray) 


•  New  microprocessor  generations  “reset”  performance  to  around  6%  of  peak 

•  Performance  degrades  to  1%  -  3%  of  peak  as  ciock  speed  increases 
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Why  applications  with  iimited  memory 
reuse  perform  inefficientiy  today 

STREAMS  ADD:  Computes  A  +  B  for  long  vectors  A  and  B 

Year  of  Introduction  (Cray) 


Clock  Speed  MHz 

•  New  microprocessor  generations  “reset”  performance  to  around  6%  of  peak 

•  Performance  degrades  to  1%  -  3%  of  peak  as  ciock  speed  increases 
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Long  FFTs  are  Inefficient 

CacheBench 


Direct  correlation  between 
memory  bandwidth  from 
various  levels  of  the  memory 
hierarchy  and  the  performance 
of  real  applications 
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Pertlium  III.  ^65  133  1  GB  FLAM 
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For  the  CacheBench  benchmark  see  ici.cs.utk.edu.projects/licbench/cachebench.html  (Phil  Mucci) 
For  FFTW  software  see  www.fftw.org  (Matteo  Frigo  &  Steven  G.  Johnson) 
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Long  FFTs  are  Inefficient 


CacheBench 


Direct  correlation  between 
memory  bandwidth  from 
various  levels  of  the  memory 
hierarchy  and  the  performance 
of  real  applications 


fiAcjmc-fv  Hier-ah;Kv 

Pertlium  III.  ^65  133  1  GB  FLAM 

Pedhat  7.3.  Kernel  £.4.16.  Low  Lateficy  Falch  Applied 
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FFTW  Benchmark  Data  as  %PEAK 
Intel  Processors 
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HPCS  Challenge 
Cost  effective  signal 
processing  in  software 
for  high  throughput 
streaming  applications 


For  the  CacheBench  benchmark  see  ici.cs.utk.edu.projects/licbench/cachebench.html  (Phil  Mucci) 
For  FFTW  software  see  www.fftw.org  (Matteo  Frigo  &  Steven  G.  Johnson) 
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Value  Proposition:  Metrics 


Producer 


Consumer 


*  Sells  computers 

*  Sells  support 

*  Profit 

*  Market  share 

*  Stockholder’s  equity 

*  Reputation 

*  Peak  rates 

*  Customer  satisfaction 

*  Deliver  solutions 

*  Novel  technology 


*  Has  national  security  mission 

*  Needs  a  computer  to  process 
data  or  calculate  answers 

*  In  time-time  to  solution 

*  Fits  (size,  weight,  power) 

*  Easy  to  program-idea  to  sol’n 

*  Affordable-life-cycle,  facilities 
and  support  costs 

*  Efficient-sustained  rates 

*  Reliable 

*  Evolvable 

* 
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Thesis 

*  Over  the  last  1 0  years  R&D  investments  have  made  high 
performance  embedded  computing  for  national  security 
applications  more  like  mainstream  high  performance 
computing 

-  Looking  for  good  HPEC-SI  demonstrations 


*  Over  the  next  10  years  R&D  investments  will  make 
mainstream  high  performance  computing  for  national 
security  applications  more  like  high  performance 
embedded  computing 

-  Looking  for  good  HPCS  chaiienge  probiems 

Richard  Games,  rg(g)mitre.org 
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